Interested in next version of Pony Diffusion? Read update here: https://civitai.com/articles/5069/towards-pony-diffusion-v7
You may've seen score_9
or its longer version score_9, score_8_up, score_7_up, score_6_up, score_5_up, score_4_up
used in prompts for Pony Diffusion, let's explore what this tag is, how it came to be and how to correctly use it to generate better images.
Why
The (simplified) lifecycle of an AI model consists of two stages - Training and Inference.
During Training a model that either doesn't know how to do anything (training from scratch) or doesn't have specific knowledge (finetuning) is repeatedly fed with image-caption pairs, teaching it concepts that make sense from perspective of humans. This is a long process and for Pony Diffusion V6 it took about 3 months on a very beefy hardware.
When the model is finished we start using it to generate images, this is called Inference.
There are a number of challenges we need to overcome to be able to actually generate something nice.
Computers do not understand concept of "nice" and images generated during Inference would generally match (quality of) the ones observed during Training (clever people call this GIGO).
An obvious (and unfortunately naïve) option is to train models only on good data. First of all, a lot of concepts (i.e. characters, objects, actions) may not have enough good data to help model learn it. Secondly, we still don't know how to separate good data from bad data in our database of source images.
If we want a diverse model (in other words, a model that knows many characters and other concepts) we need to grab as much data as we can. But also, not too much, as with more data comes more training time, and more $$$ burned.
So we need a way to find the subset of good data in all the data available to us and then rank images within this dataset.
Teaching machines to know what is good
Luckily for us we have ways of educating machines on what is considered good looking by humans. There are many ways to do so, but PSAI is using something called "CLIP based aesthetic ranking".
CLIP or Contrastive Language-Image Pre-training is a way to pair matching images and captions. To put it simple, it's another AI model trained on large dataset of captions created by humans and is capable of accepting both images and texts, measuring how much they correlate to each other. In addition to learning about things like "dog" or "cat" CLIP also learns about concepts like "masterpiece", "best quality" or "hd" (as these words are common in image captions created by humans).
If you have used Stable Diffusion with other models you may've used such keywords/tags to improve quality of your generations.
So why not to just grab CLIP and use it everywhere if it's so good at measuring which images correspond to "masterpiece" and other similar tags? Well, we again have challenges to overcome.
CLIP has been trained on a little bit of everything and the quality of the captions used is somewhat limited, meaning CLIP is not very good at non photo realistic and somewhat less popular content, like ponies, or cartoon furry characters (and works much better on Anime).
But, we can still use CLIP as within its internals contained plenty of signals that may not necessary have a good name attached to them but if we can surface them, then we can use them to separate good images form bad ones.
Enter data labeling hell
In order to implement our plan we still need a lot of good images (but also many not so good, and some very bad ones). How can we get some? Well for once we can look at various scores/ranks assigned to them on popular boorus to pick some images.
At this point you may say - "Hey, wait a minute. You already have the scores! Just use them to pick good images!" and you will be partially right. Some models (including early Pony Diffusion ones) used such score metadata.
Unfortunately, using scores introduces two issues - users rate images based on both quality and content, and while they are generally correlated, there are some biases like NSFW content being ranked higher, or specific characters getting preferential treatment independently of the quality, also these scores are affected by age of the image and do not match between different sources of metadata (i.e. a score 100 on one site may be top 1% while on other it's an average score).
So, at least we can use the scores to pick some decent distribution of images, now let's go over them and manually rank them in terms of quality, I personally decided to do a 1 to 5 points range. Still, two questions remain - how many images do we need and who will rank them.
We do need a lot of images, we want to have a decent number of image of each "type", some 3d, some sketches, some semi realistic, etc... Miss some style and the model will not learn how to correctly judge them.
In case of V6 this number was ~20k manually labeled images. Now, we need someone who can look at images and use their art critique skills to judged on the scale we invented. And who is that impartial person, unbiased and neutral, able to make decisions or judgments based on objective criteria rather than personal feelings, interests, or prejudices? It's me, obviously. So, after spending weeks in data labeling cave methodically ranking each image I was able to generate our aesthetic dataset large enough to be useful.
We can now train a new model that would take CLIP's image representation (that we call embedding) and a human rating and learn from them how to rank new images. We then use this model on embeddings of each and every picture we encounter and get a 1 to 5 rank (which is actually now a 0 to 1 rank as computers like this range more).
We now solved two big issues, first of all we can use this new model to select only a set of images to train new model on and annotate the captions with a special tag.
So best images get a caption like score_9, a cute pony
and slightly less good images score_8, maybe not so cute pony
.
It's Training time
We now have annotated data and can finally train the actual Pony Diffusion. Let's keep showing the model images and our captions containing the score tags so it learns which of the score tags correspond to good images, giving us more controllable version of "masterpiece" and friends.
But wait, turned out I messed up a bit! What I described above is how PD V5.X used to do things, in V6 I wanted to also be able to say - "hey, give me anything 80% good and up". But score_8
tag would only give us images in range 80% to 90%. Perhaps using both score_8
and score_9
would work but I wanted to verify that, so I changed the labels form simple score_9
to something more verbose like score_9, score_8_up, score_7_up, score_6_up, score_5_up, score_4_up
and score_8
to score_8, score_7_up, score_6_up, score_5_up, score_4_up
. In reality I exposed myself to a variation of The Clever Hans effect where the model learned that the whole long string correlates to the "good looking" images, instead of separate parts of it. Unfortunately by the time I realized it, we were way past the mid point of training, so I just rolled with it (I did try to use shorter tags after the discovery but due to the way we train it didn't have as strong of an effect).
I will fix this in V7.
tldr: we used a model trained on human preferences to label all data with special tags and then trained a text to image model on these labels allowing us to ask model for "good" images via use of these tags.
Do I need to care?
Maybe... in some cases.
Some tools like PSAI discord bot adds score_9, ...
tags automatically, but in other UIs you will need to add the long version yourself. In some UIs (Auto1111) you can save it as a style, on other just copy paste it into beginning of all your prompts.
The score tags have some bias in them, so if you are using style/artist LoRAs it sometimes makes sense to exclude the tags and see how the model reacts. I don't have good recommendation on using them in training as I don't do LoRAs on my own.
Oh, and one more important note - these tags are less useful in negatives as you can only go as low as 4 and ideally we should train on quality all the way down to 1, but it makes the training significantly more expensive. So you can still use score_4/5 in negative but they will not push you away from really bad images.